Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'crm sbd' sub-level (jsc#PED-8256) #1491

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

liangxin1300
Copy link
Collaborator

@liangxin1300 liangxin1300 commented Jul 17, 2024

Motivation

The main configurations for sbd use cases are scattered among sysconfig,
on-disk meta data, CIB, and even could be related to other OS components
eg. coredump, SCSI, multipath.

It's desirable to reduce the management complexity among them and to
streamline the workflow for the main use case scenarios.

Changed include

Disk-based SBD scenarios

  1. Show usage when syntax error
  2. Completion
  3. Display SBD related configuration (UC4 in PED-8256)
  4. Change the on-disk meta data of the existing sbd disks (UC2.1 in PED-8256)
  5. Add a sbd disk with the existing sbd configuration (UC2.2 in PED-8256)
  6. Remove a sbd disk (UC2.3 in PED-8256)
  7. Remove sbd from cluster
  8. Replace the storage for a sbd disk (UC2.4 in PED-8256)]
  9. display status (focusing on the runtime information only) (UC5 in PED-8256)

Disk-less SBD scenarios

  1. Show usage when syntax error (diskless)
  2. completion (diskless)
  3. Display SBD related configuration (UC4 in PED-8256, diskless)
  4. Manipulate the basic diskless sbd configuration (UC3.1 in PED-8256)

@liangxin1300 liangxin1300 force-pushed the 20240614_crm_sbd_sublevel branch 2 times, most recently from 338ed50 to 2f10c6e Compare July 17, 2024 13:52
@liangxin1300
Copy link
Collaborator Author

liangxin1300 commented Jul 18, 2024

Disk-based SBD scenarios

1. Show usage when syntax error

# crm sbd configure xx
ERROR: Invalid argument: xx
Usage:
crm sbd configure show [disk_metadata|sysconfig|property]
crm sbd configure [watchdog-timeout=<integer>] [allocate-timeout=<integer>] [loop-timeout=<integer>] [msgwait-timeout=<integer>] [watchdog-device=<device>]

More syntax errror cases
See https://github.com/liangxin1300/crmsh/blob/20240614_crm_sbd_sublevel/test/features/sbd_ui.feature

2. Completion

# crm sbd 
cd          configure   device      disable     help        ls          quit        status      up 

# crm sbd configure 
allocate-timeout=  msgwait-timeout=   watchdog-device=   
loop-timeout=      show               watchdog-timeout=  

# crm sbd configure show 
disk_metadata   property        sysconfig

3. Display SBD related configuration (UC4 in PED-8256)

# crm sbd configure show disk_metadata 
INFO: crm sbd configure show disk_metadata
==Dumping header on disk /dev/sda5
Header version     : 2.1
UUID               : a4f4d842-278c-485d-ada6-7781d88bd632
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 30
==Header on disk /dev/sda5 is dumped

# crm sbd configure show sysconfig 
INFO: crm sbd configure show sysconfig
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=71
SBD_WATCHDOG_DEV=/dev/watchdog0
SBD_WATCHDOG_TIMEOUT=15
SBD_TIMEOUT_ACTION=flush,reboot
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_SYNC_RESOURCE_STARTUP=yes
SBD_OPTS=
SBD_DEVICE=/dev/sda5

# crm sbd configure show property 
INFO: crm sbd configure show property
pcmk_delay_max=30s
have-watchdog=true
stonith-enabled=true
stonith-timeout=83
priority-fencing-delay=60

INFO: systemctl show -p TimeoutStartUSec sbd.service --value
TimeoutStartUSec=90

4. Change the on-disk meta data of the existing sbd disks (UC2.1 in PED-8256)

# crm sbd configure watchdog-timeout=30
WARNING: It's recommended to set msgwait timeout >= 2*watchdog timeout
INFO: Initializing SBD device /dev/sda5
INFO: Update SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd: 30
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: Resource is running, need to restart cluster service manually on each node
WARNING: "priority-fencing-delay" in crm_config is set to 60, it was 0

# crm sbd configure msgwait-timeout=60
INFO: Initializing SBD device /dev/sda5
WARNING: Resource is running, need to restart cluster service manually on each node
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 101
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: "stonith-timeout" in crm_config is set to 119, it was 83

# crm sbd configure show disk_metadata 
==Dumping header on disk /dev/sda5
Header version     : 2.1
UUID               : 15b0a922-ab1b-4abd-b1d1-ab712a12a1ec
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 30
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 60
==Header on disk /dev/sda5 is dumped

# crm sbd configure watchdog-timeout=15
INFO: Initializing SBD device /dev/sda5
INFO: Update SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd: 15
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: Resource is running, need to restart cluster service manually on each node

# crm sbd configure show disk_metadata 
==Dumping header on disk /dev/sda5
Header version     : 2.1
UUID               : d3918be3-8f51-4ca8-aea7-2d3dabdb7fa2
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 60
==Header on disk /dev/sda5 is dumped

5. Add a sbd disk with the existing sbd configuration (UC2.2 in PED-8256)

# crm sbd configure show disk_metadata 
INFO: crm sbd configure show disk_metadata
==Dumping header on disk /dev/sda5
Header version     : 2.1
UUID               : eccc76d1-1930-4437-8cb3-2726ca0ac293
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 30
==Header on disk /dev/sda5 is dumped

# crm sbd configure show sysconfig |grep DEVICE
SBD_DEVICE=/dev/sda5

# crm sbd device add /dev/sda6
INFO: Configured sbd devices: /dev/sda5
INFO: Append devices: /dev/sda6
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda6
INFO: Update SBD_DEVICE in /etc/sysconfig/sbd: /dev/sda5;/dev/sda6
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: Resource is running, need to restart cluster service manually on each node

# crm sbd configure show disk_metadata 
INFO: crm sbd configure show disk_metadata
==Dumping header on disk /dev/sda5
Header version     : 2.1
UUID               : eccc76d1-1930-4437-8cb3-2726ca0ac293
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 30
==Header on disk /dev/sda5 is dumped

==Dumping header on disk /dev/sda6
Header version     : 2.1
UUID               : 50dbd68b-dcab-4280-b8c4-3af2070acfba
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 30
==Header on disk /dev/sda6 is dumped

# crm sbd configure show sysconfig |grep DEVICE
SBD_DEVICE="/dev/sda5;/dev/sda6"

6. Remove a sbd disk (UC2.3 in PED-8256)

# crm sbd configure show sysconfig |grep DEVICE
SBD_DEVICE="/dev/sda5;/dev/sda6"

# crm sbd device remove /dev/sda
/dev/sda5   /dev/sda6   

# crm sbd device remove /dev/sda6
INFO: Configured sbd devices: /dev/sda5;/dev/sda6
INFO: Remove devices: /dev/sda6
INFO: Update SBD_DEVICE in /etc/sysconfig/sbd: /dev/sda5
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Requires to restart cluster service to take effect

# crm sbd configure show sysconfig |grep DEVICE
SBD_DEVICE=/dev/sda5

7. Purge sbd from cluster

# crm sbd purge 
INFO: Stop sbd resource 'stonith-sbd'(stonith:fence_sbd)
INFO: Remove sbd resource 'stonith-sbd'
INFO: Disable sbd.service on node alp-1
INFO: Disable sbd.service on node alp-2
INFO: Move /etc/sysconfig/sbd to /etc/sysconfig/sbd.bak on all nodes
INFO: Delete cluster property "stonith-timeout" in crm_config
INFO: Delete cluster property "priority-fencing-delay" in crm_config
WARNING: "stonith-enabled" in crm_config is set to false, it was true
INFO: Restarting cluster service
INFO: BEGIN Waiting for cluster
..........                                                                                                                                                                                         
INFO: END Waiting for cluster

8. Replace the storage for a sbd disk (UC2.4 in PED-8256)

# crm sbd configure show disk_metadata 
INFO: crm sbd configure show disk_metadata
==Dumping header on disk /dev/sda5
Header version     : 2.1
UUID               : a14c7997-b4e8-4490-aec2-3cd0e5746126
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 30
==Header on disk /dev/sda5 is dumped

==Dumping header on disk /dev/sda6
Header version     : 2.1
UUID               : 9a7fd3fd-71ad-4222-8ff2-e171ed9b776c
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 30
==Header on disk /dev/sda6 is dumped

# crm sbd device /dev/sda6
ERROR: Invalid argument: /dev/sda6
INFO: Usage: crm sbd device <add|remove> <device>...

# crm sbd device remove /dev/sda6
INFO: Configured sbd devices: /dev/sda5;/dev/sda6
INFO: Remove devices: /dev/sda6
INFO: Update SBD_DEVICE in /etc/sysconfig/sbd: /dev/sda5
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Requires to restart cluster service to take effect

# crm sbd device add /dev/sda9
INFO: Configured sbd devices: /dev/sda5
INFO: Append devices: /dev/sda9
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda9
INFO: Update SBD_DEVICE in /etc/sysconfig/sbd: /dev/sda5;/dev/sda9
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: Resource is running, need to restart cluster service manually on each node

# crm sbd configure show disk_metadata 
INFO: crm sbd configure show disk_metadata
==Dumping header on disk /dev/sda5
Header version     : 2.1
UUID               : a14c7997-b4e8-4490-aec2-3cd0e5746126
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 30
==Header on disk /dev/sda5 is dumped

==Dumping header on disk /dev/sda9
Header version     : 2.1
UUID               : ad807485-70cd-4216-8a03-26197de0878a
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 30
==Header on disk /dev/sda9 is dumped

# crm cluster restart --all
INFO: The cluster stack stopped on alp-1
INFO: The cluster stack stopped on alp-2
INFO: The cluster stack started on alp-1
INFO: The cluster stack started on alp-2

# ps -ef|grep sbd
root        3578       1  0 13:53 ?        00:00:00 sbd: inquisitor
root        3579    3578  0 13:53 ?        00:00:00 sbd: watcher: /dev/sda5 - slot: 0 - uuid: a14c7997-b4e8-4490-aec2-3cd0e5746126
root        3580    3578  0 13:53 ?        00:00:00 sbd: watcher: /dev/sda9 - slot: 0 - uuid: ad807485-70cd-4216-8a03-26197de0878a
root        3581    3578  0 13:53 ?        00:00:00 sbd: watcher: Pacemaker
root        3582    3578  0 13:53 ?        00:00:00 sbd: watcher: Cluster

9. display status (focusing on the runtime information only) (UC5 in PED-8256)

# crm sbd status
# Type of SBD:
Disk-based SBD configured

# Status of sbd.service:
Node   |Active  |Enabled |Since
alp-1  |YES     |YES     |active since: Tue 2024-10-22 13:53:31 CST
alp-2  |YES     |YES     |active since: Tue 2024-10-22 13:53:31 CST

# Watchdog info:
Node   |Device          |Driver    |Kernel Timeout
alp-1  |/dev/watchdog0  |iTCO_wdt  |10
alp-2  |/dev/watchdog0  |iTCO_wdt  |10

# Status of fence_sbd:
resource stonith-sbd is running on: alp-1

10. overwrite case

  • Added device has the same metadata with configured devices
# crm sbd device add /dev/sda7
INFO: Configured sbd devices: /dev/sda5;/dev/sda6
/dev/sda7 has already been initialized by SBD - overwrite (y/n)? n
INFO: Append devices: /dev/sda7
INFO: Update SBD_DEVICE in /etc/sysconfig/sbd: /dev/sda5;/dev/sda6;/dev/sda7
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: Resource is running, need to restart cluster service manually on each node
  • Overwrite added device
# crm sbd device add /dev/sda7
INFO: Configured sbd devices: /dev/sda5;/dev/sda6
/dev/sda7 has already been initialized by SBD - overwrite (y/n)? y
INFO: Append devices: /dev/sda7
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda7
INFO: Update SBD_DEVICE in /etc/sysconfig/sbd: /dev/sda5;/dev/sda6;/dev/sda7
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Restarting cluster service
INFO: BEGIN Waiting for cluster
INFO: END Waiting for cluster
  • Added device has different metadata with configured devices
alp-1:~ # crm sbd device add /dev/sda7
INFO: Configured sbd devices: /dev/sda5;/dev/sda6
/dev/sda7 has already been initialized by SBD - overwrite (y/n)? n
WARNING: Device /dev/sda7 doesn't have the same metadata as /dev/sda5
  • Overwrite device via crm cluster init, interactive mode
# crm cluster init sbd
...
Do you wish to use SBD (y/n)? y
SBD_DEVICE in /etc/sysconfig/sbd is already configured to use '/dev/sda5;/dev/sda6' - overwrite (y/n)? y
Path to storage device (e.g. /dev/disk/by-id/...), or "none" for diskless sbd, use ";" as separator for multi path []/dev/sda6
/dev/sda6 has already been initialized by SBD - overwrite (y/n)? y
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda6
INFO: Update SBD_DEVICE in /etc/sysconfig/sbd: /dev/sda6
INFO: Update SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd: 15
INFO: Update SBD_WATCHDOG_DEV in /etc/sysconfig/sbd: /dev/watchdog0
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Enable sbd.service on node alp-1
INFO: Enable sbd.service on node alp-2
INFO: Restarting cluster service
...
  • Not overwrite, while with different metadata
# crm cluster init sbd
...
Do you wish to use SBD (y/n)? y
SBD_DEVICE in /etc/sysconfig/sbd is already configured to use '/dev/sda5;/dev/sda6' - overwrite (y/n)? y
Path to storage device (e.g. /dev/disk/by-id/...), or "none" for diskless sbd, use ";" as separator for multi path []/dev/sda6;/dev/sda7
/dev/sda6 has already been initialized by SBD - overwrite (y/n)? n
/dev/sda7 has already been initialized by SBD - overwrite (y/n)? n
WARNING: Device /dev/sda7 doesn't have the same metadata as /dev/sda6
  • Partly overwrite, use the first device's metadata
 # crm cluster init sbd
...
Do you wish to use SBD (y/n)? y
SBD_DEVICE in /etc/sysconfig/sbd is already configured to use '/dev/sda6;/dev/sda7' - overwrite (y/n)? y
Path to storage device (e.g. /dev/disk/by-id/...), or "none" for diskless sbd, use ";" as separator for multi path []/dev/sda5;/dev/sda7
/dev/sda5 has already been initialized by SBD - overwrite (y/n)? n
/dev/sda7 has already been initialized by SBD - overwrite (y/n)? y
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda7
INFO: Update SBD_DEVICE in /etc/sysconfig/sbd: /dev/sda5;/dev/sda7
INFO: Update SBD_WATCHDOG_DEV in /etc/sysconfig/sbd: /dev/watchdog0
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Enable sbd.service on node alp-1
INFO: Enable sbd.service on node alp-2
INFO: Restarting cluster service
INFO: BEGIN Waiting for cluster
INFO: END Waiting for cluster
WARNING: "stonith-enabled" in crm_config is set to true, it was false
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 71
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: "stonith-timeout" in crm_config is set to 83, it was 60s
cWARNING: "priority-fencing-delay" in crm_config is set to 60, it was 0
INFO: Done (log saved to /var/log/crmsh/crmsh.log on alp-1)

# crm sbd  configure show disk_metadata 
INFO: crm sbd configure show disk_metadata
==Dumping header on disk /dev/sda5
Header version     : 2.1
UUID               : a02ade38-169c-4c03-bb1b-8bade3126fe8
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 30
==Header on disk /dev/sda5 is dumped

==Dumping header on disk /dev/sda7
Header version     : 2.1
UUID               : a2ebd6a0-ab95-40d6-9132-104d96195fc7
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 15
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 30
==Header on disk /dev/sda7 is dumped
  • Not overwrite the sysconfig
# crm cluster init
...
Do you wish to use SBD (y/n)? y
SBD_DEVICE in /etc/sysconfig/sbd is already configured to use '/dev/sda5;/dev/sda7' - overwrite (y/n)? n
WARNING: Hawk not installed - not configuring web management interface.
INFO: BEGIN Waiting for cluster
............ 

# crm sbd status
# Type of SBD:
Disk-based SBD configured

# Status of sbd.service:
Node   |Active  |Enabled |Since
alp-1  |YES     |YES     |active since: Wed 2024-10-23 09:36:31 CST

# Watchdog info:
Node   |Device          |Driver    |Kernel Timeout
alp-1  |/dev/watchdog0  |iTCO_wdt  |10

# Status of fence_sbd:
resource stonith-sbd is running on: alp-1

Disk-less SBD scenarios

1. Show usage when syntax error (diskless)

# crm sbd configure xx
ERROR: Invalid argument: xx
Usage:
crm sbd configure show [sysconfig|property]
crm sbd configure [watchdog-timeout=<integer>] [watchdog-device=<device>]

2. completion (diskless)

# crm sbd configure 
show               watchdog-device=   watchdog-timeout=  
# crm sbd configure show 
property    sysconfig

3. Display SBD related configuration (UC4 in PED-8256, diskless)

# crm sbd configure show
INFO: crm sbd configure show sysconfig
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=41
SBD_WATCHDOG_DEV=/dev/watchdog0
SBD_WATCHDOG_TIMEOUT=15
SBD_TIMEOUT_ACTION=flush,reboot
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_SYNC_RESOURCE_STARTUP=yes
SBD_OPTS=

INFO: crm sbd configure show property
have-watchdog=true
stonith-enabled=true
stonith-watchdog-timeout=-1
stonith-timeout=71

INFO: systemctl show -p TimeoutStartUSec sbd.service --value
TimeoutStartUSec=90

4. Manipulate the basic diskless sbd configuration (UC3.1 in PED-8256)

# crm sbd configure watchdog-timeout=31
INFO: Configuring diskless SBD
WARNING: Diskless SBD requires cluster with three or more nodes. If you want to use diskless SBD for 2-node cluster, should be combined with QDevice.
INFO: Update SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd: 31
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Restarting cluster service
INFO: BEGIN Waiting for cluster
...........                                                                                                       
INFO: END Waiting for cluster
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 73
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: "stonith-timeout" in crm_config is set to 85, it was 71

# crm sbd configure show
INFO: crm sbd configure show sysconfig
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=73
SBD_WATCHDOG_DEV=/dev/watchdog0
SBD_WATCHDOG_TIMEOUT=31
SBD_TIMEOUT_ACTION=flush,reboot
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_SYNC_RESOURCE_STARTUP=yes
SBD_OPTS=

INFO: crm sbd configure show property
have-watchdog=true
stonith-enabled=true
stonith-watchdog-timeout=-1
stonith-timeout=85

INFO: systemctl show -p TimeoutStartUSec sbd.service --value
TimeoutStartUSec=90

5. Remove diskless sbd from cluster

# crm sbd disable 
INFO: Disable sbd.service on node alp-1
INFO: Disable sbd.service on node alp-2
INFO: Delete cluster property "stonith-watchdog-timeout" in crm_config
INFO: Delete cluster property "stonith-timeout" in crm_config
WARNING: "stonith-enabled" in crm_config is set to false, it was true
INFO: Requires to restart cluster service to take effect

# ps -ef|grep sbd
root        3418       1  0 08:43 ?        00:00:00 sbd: inquisitor
root        3420    3418  0 08:43 ?        00:00:00 sbd: watcher: Pacemaker
root        3421    3418  0 08:43 ?        00:00:00 sbd: watcher: Cluster
root        3665    1697  0 08:45 pts/0    00:00:00 grep --color=auto sbd

# crm cluster restart --all
INFO: The cluster stack stopped on alp-1
INFO: The cluster stack stopped on alp-2
INFO: The cluster stack started on alp-1
INFO: The cluster stack started on alp-2

# ps -ef|grep sbd
root        3752    1697  0 08:45 pts/0    00:00:00 grep --color=auto sbd

@liangxin1300 liangxin1300 force-pushed the 20240614_crm_sbd_sublevel branch 9 times, most recently from a19a863 to cc0d52a Compare July 23, 2024 13:34
@liangxin1300 liangxin1300 force-pushed the 20240614_crm_sbd_sublevel branch 9 times, most recently from 1456931 to e8f53af Compare August 1, 2024 03:16
@liangxin1300 liangxin1300 force-pushed the 20240614_crm_sbd_sublevel branch 2 times, most recently from bc2a1fa to 229de46 Compare August 2, 2024 02:50
@liangxin1300 liangxin1300 force-pushed the 20240614_crm_sbd_sublevel branch 6 times, most recently from 77c1c4f to 5d17668 Compare August 20, 2024 02:07
After adding sbd device interface to manage devices, related
functionalities inside sbd configure interface should be adjusted
and make sure the metadata is consistent between devices.
Add a log message to indicate the start of pacemaker.service.
This helps users understand that the system is not hanging but
is actually starting pacemaker, especially when SBD_DELAY_START
is set and it takes longer to start pacemaker.
And the `sbd purge` command will also move /etc/sysconfig/sbd to
/etc/sysconfig/sbd.bak on all nodes.
@liangxin1300 liangxin1300 force-pushed the 20240614_crm_sbd_sublevel branch 2 times, most recently from 774ea69 to 79535f6 Compare November 25, 2024 10:10
doc/crm.8.adoc Outdated Show resolved Hide resolved
@liangxin1300
Copy link
Collaborator Author

liangxin1300 commented Nov 29, 2024

Add output of sbd process in sbd status:

# crm sbd status
# Type of SBD:
Disk-based SBD configured

# Status of sbd.service:
Node   |Active  |Enabled |Since
alp-1  |YES     |YES     |active since: Fri 2024-11-29 14:46:19 CST
alp-2  |YES     |YES     |active since: Fri 2024-11-29 14:46:19 CST

# Status of sbd process on alp-1:
├─10675 sbd: watcher: /dev/sda5 - slot: 1 - uuid: 8c0e6bbd-d067-4d7e-9531-237da2490799
├─10676 sbd: watcher: /dev/sda6 - slot: 0 - uuid: 0a48b7a3-f9cc-46e6-89a4-e1f1215bdd14
├─10677 sbd: watcher: /dev/sda7 - slot: 0 - uuid: 31776962-477d-40ce-af72-f49a9f6f5dd4

# Status of sbd process on alp-2:
├─9128 sbd: watcher: /dev/sda5 - slot: 0 - uuid: 8c0e6bbd-d067-4d7e-9531-237da2490799
├─9129 sbd: watcher: /dev/sda6 - slot: 1 - uuid: 0a48b7a3-f9cc-46e6-89a4-e1f1215bdd14
├─9130 sbd: watcher: /dev/sda7 - slot: 1 - uuid: 31776962-477d-40ce-af72-f49a9f6f5dd4

# Watchdog info:
Node   |Device          |Driver    |Kernel Timeout
alp-1  |/dev/watchdog0  |iTCO_wdt  |10
alp-2  |/dev/watchdog0  |iTCO_wdt  |10

# Status of fence_sbd:
resource stonith-sbd is running on: alp-1

- Return immediately if no changes are made
- Adjust watchdog timeout and msgwait values properly
@liangxin1300 liangxin1300 force-pushed the 20240614_crm_sbd_sublevel branch 2 times, most recently from 5a33305 to b9e9853 Compare November 29, 2024 07:08
crmsh/ui_sbd.py Show resolved Hide resolved
doc/crm.8.adoc Outdated Show resolved Hide resolved
doc/crm.8.adoc Outdated
...............
# For disk-based SBD
crm sbd configure show [disk_metadata|sysconfig|property]
crm sbd configure [device=<dev>]... [watchdog-device=<dev>] [watchdog-timeout=<integer>] [allocate-timeout=<integer>] [loop-timeout=<integer>] [msgwait-timeout=<integer>]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I am confused by crm sbd configure watchdog-timeout=.... There are 3 similar items: Timeout (watchdog) :, SBD_WATCHDOG_TIMEOUT= and stonith-watchdog-timeout=. Which ones are expected be modified by this command?

Two major scenarios:

  1. disk-based
    Timeout (watchdog) : in the disk metadata is used. SBD_WATCHDOG_TIMEOUT= is useless
  2. diskless
    SBD_WATCHDOG_TIMEOUT= and stonith-watchdog-timeout= are meant to be used by diskless-sbd only

doc/crm.8.adoc Outdated
...............
# For disk-based SBD
crm sbd configure show [disk_metadata|sysconfig|property]
crm sbd configure [device=<dev>]... [watchdog-device=<dev>] [watchdog-timeout=<integer>] [allocate-timeout=<integer>] [loop-timeout=<integer>] [msgwait-timeout=<integer>]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Timeout (watchdog) : and SBD_WATCHDOG_TIMEOUT= controls the same thing, and only one of them is effective, we should show only the effective one in crm sbd configure show, or indicate which one is effective in some way.

I'm kind of agree with you. Let's keep debating with Xin ;)

# To keep the order of devices during removal
left_device_list = [dev for dev in self.device_list_from_config if dev not in devices_to_remove]
if len(left_device_list) == 0:
raise self.SyntaxError("Not allowed to remove all devices")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise self.SyntaxError("Not allowed to remove all devices")
raise self.SyntaxError("Not allowed to remove all devices. Run `crm cluster init sbd -S` to bootstrap the diskless-sbd")

Intentionally, not give "-F" directly here, to have user think this twice. We can debate this.

return False
if not sbd.SBDUtils.is_using_disk_based_sbd():
logger.error("Only works for disk-based SBD")
logger.info("Please use 'crm cluster init -s <dev1> [-s <dev2> [-s <dev3>]]' to configure disk-based SBD first")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to suggest using the SBD stage

Suggested change
logger.info("Please use 'crm cluster init -s <dev1> [-s <dev2> [-s <dev3>]]' to configure disk-based SBD first")
logger.info("Please use 'crm cluster init sbd -s <dev1> [-s <dev2>]' to configure the disk-based SBD first")

for node in self.cluster_nodes:
out = self.cluster_shell.get_stdout_or_raise_error(scripts_in_shell, node)
if out:
print(f"# Status of sbd process on {node}:")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"# Status of sbd process on {node}:")
print(f"# Status of the sbd disk watcher process on {node}:")

And, this information should no be printed out for the diskless-sbd.

)
sbd_manager.init_and_deploy_sbd()

def _configure_diskless(self, parameter_dict: dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the disk-based sbd, I expect to see TimeoutStartSec get updated along with crm sbd configure watchdog-timeout=50, for example.

Sounds like, the diskless bootstrap code doesn't do this too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants